-
-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multitenant migrations execution #431
Conversation
9075ee7
to
30370d1
Compare
Pull Request Test Coverage Report for Build 7842314553
💛 - Coveralls |
30370d1
to
5e47df6
Compare
d350181
to
687e60e
Compare
@@ -1,24 +1,36 @@ | |||
FROM node:18-alpine | |||
# Base stage for shared environment setup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved Dockerfile to use node_modules cache on each step
####################################### | ||
# MULTI_TENANT=true | ||
DATABASE_MULTITENANT_URL=postgresql://postgres:[email protected]:5433/postgres | ||
REQUEST_X_FORWARDED_HOST_REGEXP= | ||
REQUEST_X_FORWARDED_HOST_REGEXP=^([a-z]{20}).local.(?:com|dev)$ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default value for multitenant x-forwarded-host: {tenant-id}.local.dev
to make it easy to run Storage out of the box as multitenant
@@ -85,7 +78,7 @@ STORAGE_BACKEND=s3 | |||
####################################### | |||
# S3 Backend | |||
####################################### | |||
STORAGE_S3_BUCKET=name-of-your-s3-bucket | |||
STORAGE_S3_BUCKET=supa-storage-bucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default value for bucket name which is the same as the bucket we create on the docker-compose.
This makes it easy to simply rename the .env.sample
to .env
without any additional changes
@@ -23,9 +23,10 @@ A scalable, light-weight object storage service. | |||
|
|||
- Copy `.env.sample` to `.env` file. | |||
- Copy `.env.test.sample` to `.env.test`. | |||
- Change `GLOBAL_S3_BUCKET` and `REGION` to the name and region of a S3 bucket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now simply rename .env.sample
to .env
and that's it, all default values will work out of the box
@@ -258,8 +261,9 @@ export function getConfig(options?: { reload?: boolean }): StorageConfigType { | |||
logflareEnabled: getOptionalConfigFromEnv('LOGFLARE_ENABLED') === 'true', | |||
logflareApiKey: getOptionalConfigFromEnv('LOGFLARE_API_KEY'), | |||
logflareSourceToken: getOptionalConfigFromEnv('LOGFLARE_SOURCE_TOKEN'), | |||
defaultMetricsEnabled: | |||
getOptionalConfigFromEnv('DEFAULT_METRICS_ENABLED', 'ENABLE_DEFAULT_METRICS') === 'true', | |||
defaultMetricsEnabled: !( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
default metrics are now on by default
@@ -346,7 +396,7 @@ async function refreshMigrationPosition( | |||
if (shouldUpdateMigrations) { | |||
await client.query(`BEGIN`) | |||
try { | |||
await client.query(`DELETE FROM ${migrationTableName}`) | |||
await client.query(`DELETE FROM ${migrationTableName} WHERE id is not NULL`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
running deletes without where statements seems to be blocked by default on staging / prod with a pg extension
src/http/routes/tenant/migrations.ts
Outdated
if (!pgQueueEnable) { | ||
return reply.code(400).send({ message: 'Queue is not enabled' }) | ||
} | ||
const queueSize = await Queue.getInstance().getQueueSize(RunMigrationsOnTenants.getQueueName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
admin route to see how many remaining migrations are pending at: GET /admin/migrations
payload:
{
"remaining": 10
}
const is4xxResponse = statusOfType(log.res.statusCode, 400) | ||
const is5xxResponse = statusOfType(log.res.statusCode, 500) | ||
|
||
const logLevel = is4xxResponse ? 'warn' : is5xxResponse ? 'error' : 'info' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks to @egor-romanov to have pointed out that request logs were always info.
This changes the log level:
- anything 4xx is a warning
- anything 5xx is an error
return {} | ||
} | ||
|
||
static batchSend<T extends BaseEvent<any>[]>(messages: T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented batch insert for inserting multiple jobs in one insert statement
tenantId: string | ||
} | ||
|
||
export class RunMigrationsOnTenants extends BaseEvent<RunMigrationsPayload> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job that handles tenant migrations
src/server.ts
Outdated
@@ -28,6 +33,20 @@ const exposeDocs = true | |||
if (isMultitenant) { | |||
await runMultitenantMigrations() | |||
await listenForTenantUpdate(PubSub) | |||
|
|||
runMigrationsOnAllTenants() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
on server start up, all storage api instances will race to acquire the lock.
The one who acquired the lock will create the jobs for running the migrations, the other instances who failed will simply return and do nothing.
707b53f
to
186f5e8
Compare
186f5e8
to
e291a85
Compare
ca4dfb8
to
5ba7cc1
Compare
return | ||
} | ||
const result = await multitenantKnex.raw(`SELECT pg_try_advisory_lock(?);`, [ | ||
'-8575985245963000605', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whats does this number represent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the lock id
🎉 This PR is included in version 0.48.0 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
What kind of change does this PR introduce?
Feature
What is the current behavior?
Currently, migrations on tenants are run on the first request from when the server starts.
With this approach, there are a few different drawbacks:
What is the new behavior?
This PR introduces 3 migrations strategy modes:
Migrations are now tracked in the multi-tenant db with
migrations_version
andmigrations_status
.In all modes, migrations will be run only if a tenant is not on the latest version.
On Request
on_request mode will run the migrations to a tenant database during a request. Migrations will block the request lifecycle until completed.
In case the migration fails, the request will also fail with a 500 error.
Full Fleet
This mode offers a background scheduler to run migrations across all tenants asynchronously.
On Server Start, the instances race to acquire an advisory lock.
The instance that acquires the lock is responsible for selecting all tenants that need a migration to run.
We emit
RunMigrationsOnTenant
job for each tenant in a batch, we use a singleton key to make sure only 1 job for this specific tenant can exist at any one time.Each Storage instance can consume the events and run the migrations on each single tenant database.
Migrations are retried automatically 3 times.
Failures don't impact the request lifecycle and the migrations will run on all tenants that are not on the latest version.
Progressive
This mode is a hybrid version of the 2 above. It works similarly to the
full_fleet
mode since migrations are run with an async job, but the tenants are selected during a request.During a tenant request, we determine if the tenant needs a migration to run.
If it does we register the tenantId in a list in memory.
A background worker will pop a batch of 200 tenants from this list and schedule
RunMigrationsOnTenant
jobs every 1m.Deduping between instances is done via the
singletonKey
and early checks.In case of failure, migrations are retried 3 times, after which the status becomes
FAILED_STALE
.When using progressive migrations and the staus is
FAILED_STALE
we will continue running the migrations during the request lifecycle without blocking the request for this tenant, with the hope that it recovers from unexpected prolonged issues to the underline database.Additional context
With this change, we will see the following benefits: